Goto

Collaborating Authors

 data documentation


Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research

Marandi, Ramtin Zargari, Frahm, Anne Svane, Milojevic, Maja

arXiv.org Artificial Intelligence

Despite progresses in data engineering, there are areas with limited consistencies across data validation and documentation procedures causing confusions and technical problems in research involving machine learning. There have been progresses by introducing frameworks like "Datasheets for Datasets", however there are areas for improvements to prepare datasets, ready for ML pipelines. Here, we extend the framework to "Datasheets for AI and medical datasets - DAIMS." Our publicly available solution, DAIMS, provides a checklist including data standardization requirements, a software tool to assist the process of the data preparation, an extended form for data documentation and pose research questions, a table as data dictionary, and a flowchart to suggest ML analyses to address the research questions. The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them. In addition, we provided a flowchart mapping research questions to suggested ML methods. DAIMS can serve as a reference for standardizing datasets and a roadmap for researchers aiming to apply effective ML techniques in their medical research endeavors. DAIMS is available on GitHub and as an online app to automate key aspects of dataset evaluation, facilitating efficient preparation of datasets for ML studies.


A Standardized Machine-readable Dataset Documentation Format for Responsible AI

Jain, Nitisha, Akhtar, Mubashara, Giner-Miguelez, Joan, Shinde, Rajat, Vanschoren, Joaquin, Vogler, Steffen, Goswami, Sujata, Rao, Yuhan, Santos, Tim, Oala, Luis, Karamousadakis, Michalis, Maskey, Manil, Marcenac, Pierre, Conforti, Costanza, Kuchnik, Michael, Aroyo, Lora, Benjelloun, Omar, Simperl, Elena

arXiv.org Artificial Intelligence

Data is critical to advancing AI technologies, yet its quality and documentation remain significant challenges, leading to adverse downstream effects (e.g., potential biases) in AI applications. This paper addresses these issues by introducing Croissant-RAI, a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets. Croissant-RAI extends the Croissant metadata format and builds upon existing responsible AI (RAI) documentation frameworks, offering a standardized set of attributes and practices to facilitate community-wide adoption. Leveraging established web-publishing practices, such as Schema.org, Croissant-RAI enables dataset users to easily find and utilize RAI metadata regardless of the platform on which the datasets are published. Furthermore, it is seamlessly integrated into major data search engines, repositories, and machine learning frameworks, streamlining the reading and writing of responsible AI metadata within practitioners' existing workflows. Croissant-RAI was developed through a community-led effort. It has been designed to be adaptable to evolving documentation requirements and is supported by a Python library and a visual editor .


Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Heger, Amy K., Marquis, Liz B., Vorvoreanu, Mihaela, Wallach, Hanna, Vaughan, Jennifer Wortman

arXiv.org Artificial Intelligence

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.


The Importance of a Proper Data Culture

#artificialintelligence

Beginning with AI means you need a proper data culture to start with. AI is not magic, despite what many may still think. Before even thinking of AI, the data needs to be in order. You need documentation, policies, and most importantly a proper data culture. This is the first in a series of interviews with practitioners in the field about generating business value with AI.